2011 WIDE Crawl (wide00002)



Crawl start date: 09 March, 2011

Crawl end date: 23 December, 2011

Number of captures: 2,713,676,341

Number of unique URLs captured: 2,273,840,159

Number of hosts captured: 29,032,069

HTTP status codes (chart data)

MIME Types (chart data)

Hosts Crawled per Top-Level-Domain (chart data)

URLs crawled per Top-Level-Domain (chart data)

Link Extraction

Links were extracted from all HTML documents in the collection. These links were then grouped into 'a href' links and non 'a href' links (embeds). To determine the number of URLs still remaining to be crawled, we compare the list of all extracted links with the list of URLs already crawled.
Note: We do not have Heritrix crawl logs available for this collection. As a result, we were not able to document how many URLs were not crawled because of robots.txt policies. We'll look into analyzing the crawled robots.txt files for each host to answer this.

URLs remaining to be crawled per Top-Level-Domain (chart data)

Definitions

Hostgraph: A graph of directed edges between hosts. If there's a page from host 'A' that points to a resource in host 'B', then we have a directed edge from 'A' to 'B' in the hostgraph.
We considered only 'a href' links while generating the hostgraph.

PageRank: Implementation of Google's PageRank algorithm. The algorithm is run on the hostgraph. Only hosts that also point to resources on other hosts (inter-host links) are considered. The actual number of links between hosts is ignored.
The algorithm was run for a total of 10 iterations resulting in a list of hosts and their corresponding PageRank score.

PageRank bucket: A list of hosts ordered by their PageRank scores (descending order) is split into a number of equal sized buckets. So, if there are 'N' hosts and 'b' buckets, then each bucket contains 'N/b' number of hosts, with bucket '1' representing hosts with the highest scores, and bucket 'b' representing hosts with the lowest scores.
We created 50 PageRank buckets, each containing 275,950 hosts.

AlexaRank: Ranked list of Alexa's Top 1 Million Hosts.
The list was downloaded on 09/25/2012. 727,314 hosts from this list were crawled in the 2011 Wide Crawl (total of 584,374,573 captures from these hosts)

AlexaRank bucket: A list of hosts ordered by their AlexaRank ranking (highest rank to lowest rank) is split into a number of equal sized buckets.
We created 50 AlexaRank buckets, each containing 14,547 hosts.

Crawl completeness: Ratio of number of crawled resources to total number of discovered resources. Crawl completeness = (crawled) / (crawled + notcrawled). Per Host Crawl completeness is defined as the ratio of the number of crawled resources from the given host to the total number of discovered resources from the given host.
For this collection, Crawl completeness = 2,273,840,159 / 22,814,673,355 = 0.0996 (9.96%)
For just the AlexaRank hosts, Crawl completeness = 584,374,573 / 6,447,189,768 = 0.0906 (9.06%)

Page completeness: Ratio of number of crawled embeds to total number of linked embeds, where embeds are non 'a href' links. Page completeness = (crawled embeds) / (crawled embeds + notcrawled embeds). Per Host Page completeness is defined as the ratio of the number of crawled embeds linked to by the given host to the total number of embeds linked to by the given host.
For this collection, Page completeness = 788,632,296 / 3,934,895,311 = 0.2004 (20.04%)
For just the AlexaRank hosts, Page completeness = 131,854,245 / 1,012,862,487 = 0.1301 (13.01%)

Crawled Resources per PageRank bucket (chart data)

NotCrawled Resources per PageRank bucket (chart data)

Crawled Embeds per PageRank bucket (chart data)

NotCrawled Embeds per PageRank bucket (chart data)

Crawl Completeness Ratio for Hosts in PageRank buckets (chart data)

Page Completeness Ratio for Hosts in PageRank buckets (chart data)

Crawled Resources per AlexaRank bucket (chart data)

NotCrawled Resources per AlexaRank bucket (chart data)

Crawled Embeds per AlexaRank bucket (chart data)

NotCrawled Embeds per AlexaRank bucket (chart data)

Crawl Completeness Ratio for Hosts in AlexaRank buckets (chart data)

Page Completeness Ratio for Hosts in AlexaRank buckets (chart data)

Questions? Vinay Goel [vinay (at) archive (dot) org]